test: add tests for build pipeline scripts by shaypal5 · Pull Request #28 · leadforge-dev/leadforge

shaypal5 · 2026-04-30T16:45:54Z

Summary

Adds 37 tests for the build pipeline scripts (scripts/build_v5_snapshot.py and scripts/validate_lead_scoring_dataset.py), closing Add tests for build pipeline scripts (build_v4_snapshot.py, validate_v4_dataset.py) #23
31 unit tests in tests/scripts/test_build_v5_snapshot.py covering all pipeline functions: derive_binary_features(), cap_expected_acv(), rename_and_select(), subsample(), inject_missingness(), boost_leakage_trap()
6 CLI integration tests in tests/scripts/test_validate_cli.py covering exit codes, --out-json, --emit-release-snippet, --enforce-1000, and missing args

Key edge cases tested

subsample() with insufficient positives/negatives (warns and adjusts)
inject_missingness() rate bounds (<20%), source-conditional variation (SDR > inbound), columns not in spec unaffected
boost_leakage_trap() only affects converted leads, monotonically increases values
All pipeline functions: input immutability, determinism given seed
rename_and_select() raises on missing columns

Test discovery

Scripts live in scripts/ (not leadforge/), so they're imported via importlib.util.spec_from_file_location() — no sys.path manipulation needed.

Test plan

All 700 tests pass (663 existing + 37 new)
ruff check and ruff format clean
Pre-commit hooks pass

🤖 Generated with Claude Code

Closes #23

…idate CLI) Closes #23. Adds unit tests for all pipeline functions in scripts/build_v5_snapshot.py and CLI integration tests for scripts/validate_lead_scoring_dataset.py. Coverage: subsample edge cases (insufficient positives/negatives), inject_missingness rate bounds and source-conditional variation, derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap, and 6 CLI entrypoint tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds automated coverage for the v5 build/validation scripts by introducing unit tests for the build pipeline functions and subprocess-based CLI integration tests for the validator entrypoint.

Changes:

Add unit tests for scripts/build_v5_snapshot.py pipeline steps (feature derivation, ACV capping, renaming/selection, subsampling, missingness injection, leakage trap boost).
Add CLI integration tests for scripts/validate_lead_scoring_dataset.py covering key flags and exit codes.
Update .agent-plan.md to record completion of the test work.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File	Description
tests/scripts/test_build_v5_snapshot.py	Unit tests for v5 snapshot pipeline functions (determinism, immutability, edge cases).
tests/scripts/test_validate_cli.py	Subprocess CLI integration tests for validator entrypoint and flags.
tests/scripts/init.py	Marks `tests/scripts` as a package (test organization).
.agent-plan.md	Notes the addition of build/validation script tests in the plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T16:51:38Z

+    def test_valid_csv_exit_code_zero(self, tmp_path):
+        csv_path = _make_valid_csv(tmp_path)
+        result = subprocess.run(  # noqa: S603
+            [sys.executable, str(_SCRIPT_PATH), "--csv", str(csv_path)],
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+        assert result.returncode == 0, f"stdout: {result.stdout}\nstderr: {result.stderr}"
+
+    def test_invalid_csv_exit_code_one(self, tmp_path):
+        csv_path = _make_invalid_csv(tmp_path)
+        result = subprocess.run(  # noqa: S603
+            [sys.executable, str(_SCRIPT_PATH), "--csv", str(csv_path)],
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+        assert result.returncode == 1
+
+    def test_out_json_flag(self, tmp_path):
+        csv_path = _make_valid_csv(tmp_path)
+        json_path = tmp_path / "report.json"
+        subprocess.run(  # noqa: S603
+            [
+                sys.executable,
+                str(_SCRIPT_PATH),
+                "--csv",
+                str(csv_path),
+                "--out-json",
+                str(json_path),
+            ],
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+        # JSON report should be written regardless of pass/fail
+        assert json_path.exists()
+        report = json.loads(json_path.read_text())
+        assert "passed" in report
+        assert "checks" in report
+
+    def test_emit_release_snippet_flag(self, tmp_path):
+        csv_path = _make_valid_csv(tmp_path)
+        result = subprocess.run(  # noqa: S603
+            [
+                sys.executable,
+                str(_SCRIPT_PATH),
+                "--csv",
+                str(csv_path),
+                "--emit-release-snippet",
+            ],
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+        # Snippet should be emitted regardless of pass/fail
+        assert "RELEASE SNIPPET" in result.stdout
+


These CLI tests invoke the full validator in multiple subprocesses. Since validate_dataset() runs a scikit-learn baseline plus multi-seed leakage-trap evaluation (default 10 seeds), each invocation does a non-trivial amount of work. Consider consolidating some flag assertions into fewer subprocess runs (e.g., combine --out-json and --emit-release-snippet with the happy-path run) to keep the suite fast and reduce CI time.

Copilot · 2026-04-30T16:51:38Z

+        # All available positives should be included
+        assert result["converted"].sum() <= 10


In test_insufficient_positives, the assertion result["converted"].sum() <= 10 doesn’t actually verify the intended behavior (“all available positives should be included”). A bug that drops positives (e.g., returning 0 positives) would still satisfy <= 10. Consider asserting that the number of positives in the result equals the number of positives available in the input (or at least equals min(n_pos_needed, positives_available)).

Suggested change

# All available positives should be included

assert result["converted"].sum() <= 10

positives_available = int(df["converted"].sum())

n_pos_needed = int(100 * 0.50)

# All available positives should be included

assert result["converted"].sum() == min(n_pos_needed, positives_available)

Copilot · 2026-04-30T16:51:38Z

+        subsample(df, rng, n=100, target_rate=0.10)  # need 90 negatives
+        captured = capsys.readouterr()
+        assert "WARNING" in captured.err


test_insufficient_negatives only checks that a warning is emitted, but doesn’t assert anything about the returned sample composition/size after the adjustment. To make this regression-proof, consider asserting the output length and that the number of negatives equals the available negatives (or the adjusted target) when negatives are insufficient.

Suggested change

subsample(df, rng, n=100, target_rate=0.10) # need 90 negatives

captured = capsys.readouterr()

assert "WARNING" in captured.err

result = subsample(df, rng, n=100, target_rate=0.10) # need 90 negatives

captured = capsys.readouterr()

assert "WARNING" in captured.err

# With 200 rows at 95% conversion, only 10 negatives are available.

# Requesting 100 rows at 10% conversion would require 90 negatives,

# so the sample must be adjusted to include all 10 negatives and the

# corresponding 10 positives, for a total of 20 rows.

assert len(result) == 20

negative_count = (result["converted"] == 0).sum()

positive_count = (result["converted"] == 1).sum()

assert negative_count == 10

assert positive_count == 10

Copilot · 2026-04-30T16:51:39Z

+    def test_enforce_1000_flag_fails_on_small(self, tmp_path):
+        csv_path = _make_valid_csv(tmp_path, n=200)
+        result = subprocess.run(  # noqa: S603
+            [
+                sys.executable,
+                str(_SCRIPT_PATH),
+                "--csv",
+                str(csv_path),
+                "--enforce-1000",
+            ],
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+        assert result.returncode == 1


test_enforce_1000_flag_fails_on_small only asserts a non-zero exit code. If the 200-row fixture ever fails another validation check (e.g., baseline AUC), this test would still pass even if --enforce-1000 stopped affecting row-count enforcement. Consider asserting the failure reason via stdout (e.g., that the summary contains a failing row_count check / expected-row message), or running once without --enforce-1000 to prove it passes without enforcement.

…oundary tests - Extract shared `make_v5_dataset()` builder into `tests/conftest.py`, eliminating duplicate synthetic data generators across 3 test files - Delete 4 tautological missingness tests (trivially true at n=1000) and the weak `test_rows_come_from_input` (O(n^2), tests pandas internals) - Strengthen `test_insufficient_negatives` to assert output composition - Add `@pytest.mark.parametrize` for subsample target rate/seed combos and missingness rate bounds across multiple seeds - Add boundary tests: `subsample` with n > input, `boost_leakage_trap` with zero converted leads, `inject_missingness` with small n and unknown lead sources, `rename_and_select` with extra columns - Use session-scoped fixtures for CLI tests (avoid regenerating CSV 4x) - Add precondition assertion on fixture conversion rate so sigmoid drift produces a clear error message instead of mysterious CLI failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-30T17:00:56Z

pr-agent-context report:

This run includes unresolved review comments on PR #28 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: tests/scripts/test_validate_cli.py
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531336
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    These CLI tests invoke the full validator in multiple subprocesses. Since `validate_dataset()` runs a scikit-learn baseline plus multi-seed leakage-trap evaluation (default 10 seeds), each invocation does a non-trivial amount of work. Consider consolidating some flag assertions into fewer subprocess runs (e.g., combine `--out-json` and `--emit-release-snippet` with the happy-path run) to keep the suite fast and reduce CI time.

## COPILOT-2
Location: tests/scripts/test_build_v5_snapshot.py:231
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531378
Root author: copilot-pull-request-reviewer

Comment:
    In `test_insufficient_positives`, the assertion `result["converted"].sum() <= 10` doesn’t actually verify the intended behavior (“all available positives should be included”). A bug that drops positives (e.g., returning 0 positives) would still satisfy `<= 10`. Consider asserting that the number of positives in the result equals the number of positives available in the input (or at least equals `min(n_pos_needed, positives_available)`).
    ~~~suggestion
            positives_available = int(df["converted"].sum())
            n_pos_needed = int(100 * 0.50)
            # All available positives should be included
            assert result["converted"].sum() == min(n_pos_needed, positives_available)
    ~~~

## COPILOT-3
Location: tests/scripts/test_build_v5_snapshot.py
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531402
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `test_insufficient_negatives` only checks that a warning is emitted, but doesn’t assert anything about the returned sample composition/size after the adjustment. To make this regression-proof, consider asserting the output length and that the number of negatives equals the available negatives (or the adjusted target) when negatives are insufficient.
    ~~~suggestion
            result = subsample(df, rng, n=100, target_rate=0.10)  # need 90 negatives
            captured = capsys.readouterr()
            assert "WARNING" in captured.err
            # With 200 rows at 95% conversion, only 10 negatives are available.
            # Requesting 100 rows at 10% conversion would require 90 negatives,
            # so the sample must be adjusted to include all 10 negatives and the
            # corresponding 10 positives, for a total of 20 rows.
            assert len(result) == 20
            negative_count = (result["converted"] == 0).sum()
            positive_count = (result["converted"] == 1).sum()
            assert negative_count == 10
            assert positive_count == 10
    ~~~

## COPILOT-4
Location: tests/scripts/test_validate_cli.py
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531418
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `test_enforce_1000_flag_fails_on_small` only asserts a non-zero exit code. If the 200-row fixture ever fails another validation check (e.g., baseline AUC), this test would still pass even if `--enforce-1000` stopped affecting row-count enforcement. Consider asserting the failure reason via stdout (e.g., that the summary contains a failing `row_count` check / expected-row message), or running once without `--enforce-1000` to prove it passes without enforcement.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25178444170 attempt 1
Comment timestamp: 2026-04-30T17:00:07.780916+00:00
PR head commit: f087d764ce3e7ad8812736f062ed47c62c95f5a6

Copilot AI review requested due to automatic review settings April 30, 2026 16:45

shaypal5 added the type: test Test additions or fixes label Apr 30, 2026

shaypal5 added this to the v0.5.0 — CLI-complete release candidate milestone Apr 30, 2026

Copilot started reviewing on behalf of shaypal5 April 30, 2026 16:46 View session

shaypal5 self-assigned this Apr 30, 2026

This comment has been minimized.

Sign in to view

Copilot AI reviewed Apr 30, 2026

View reviewed changes

shaypal5 mentioned this pull request Apr 30, 2026

refactor: extract pipeline functions from scripts/ into leadforge package #29

Closed

shaypal5 merged commit 3c0b911 into main Apr 30, 2026
6 checks passed

shaypal5 deleted the add-build-pipeline-tests branch April 30, 2026 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add tests for build pipeline scripts#28

test: add tests for build pipeline scripts#28
shaypal5 merged 2 commits into
mainfrom
add-build-pipeline-tests

shaypal5 commented Apr 30, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# All available positives should be included
		assert result["converted"].sum() <= 10

Conversation

shaypal5 commented Apr 30, 2026

Summary

Key edge cases tested

Test discovery

Test plan

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants